Analysis of complete chloroplast genome sequences and insight into the phylogenetic relationships of Ferula L

Background Ferula L. is one of the largest and most taxonomically complicated genera as well as being an important medicinal plant resource in the family Apiaceae. To investigate the plastome features and phylogenetic relationships of Ferula and its neighboring genera Soranthus Ledeb., Schumannia Kuntze., and Talassia Korovin, we sequenced 14 complete plastomes of 12 species. Results The size of the 14 complete chloroplast genomes ranged from 165,607 to 167,013 base pairs (bp) encoding 132 distinct genes (87 protein-coding, 37 tRNA, and 8 rRNA genes), and showed a typical quadripartite structure with a pair of inverted repeats (IR) regions. Based on comparative analysis, we found that the 14 plastomes were similar in codon usage, repeat sequence, simple sequence repeats (SSRs), and IR borders, and had significant collinearity. Based on our phylogenetic analyses, Soranthus, Schumannia, and Talassia should be considered synonymous with Ferula. Six highly divergent regions (rps16/trnQ-UUG, trnS-UGA/psbZ, psbH/petB, ycf1/ndhF, rpl32, and ycf1) were also detected, which may represent potential molecular markers, and combined with selective pressure analysis, the weak positive selection gene ccsA may be a discriminating DNA barcode for Ferula species. Conclusion Plastids contain abundant informative sites for resolving phylogenetic relationships. Combined with previous studies, we suggest that there is still much room for improvement in the classification of Ferula. Overall, our study provides new insights into the plastome evolution, phylogeny, and taxonomy of this genus. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08868-z.

Chloroplasts are independent organelles in plant cells that have their own complete set of genomes and typically covalently closed circular DNA, which exists in cells as multiple copies [30]. The chloroplast genomes of higher plants have a highly conserved tetrad structure involving inverted repeat sequences (IRs) and large single-copy (LSC) and small single-copy (SSC) regions [31]. Chloroplast genomes are relatively conserved in terms of gene number and sequence in terrestrial plants [32]. The sizes of chloroplast genomes are generally within the range of 115-165 kb, and genome size variation is mainly affected by reverse repeat length variation. Additionally, chloroplast genomes usually exhibit uniparental inheritance and low nucleotide substitution rates [33]. At present, chloroplast genome sequences and nuclear genome sequences can be obtained using shallow whole genome sequencing technology. This is considered an effective means of improving the rate of species identification and has been developed as a tool for plant phylogenetic studies at different taxonomic levels [34][35][36][37][38][39][40][41][42]. For example, the complete plastomes and nrDNA sequences obtained based on shallow genome sequencing have greatly improved the species identification rate of Rhododendron, which is also difficult to classify [43]. Thus, the complete plastomes might insight into the phylogenetic relationships of Ferula and its neighboring genera.
Here, we used plastomes to infer the phylogenetic relationships between Ferula and its confused neighboring genera. Fourteen newly sequenced plastomes of Ferula (including Soranthus, Schumannia, and Talassia) were analyzed to (1) conduct comprehensive research on the Ferula chloroplast genome; (2) identify hotspot regions, microsatellite types, and comparative genomic divergence; (3) analyze the relationships between Ferula, Soranthus, Schumannia, and Talassia based on their complete chloroplast genomes; and (4) serve as a reference for subsequent phylogenomic studies of the genus Ferula.

Chloroplast genome features
The 14 complete cp genomes ranged from 165,607 to 167,013 bp. Newly sequenced Ferula chloroplast genome maps are shown in Fig. 1. All cp genomes possessed the typical quadripartite structure of angiosperms, consisting of a pair of inverted repeat regions (IRs: 31,392-31,880 bp) and a circular molecular structure ( Fig. 1; Table 1). All 14 cp genomes possessed 133 distinct genes arranged in the same order, including 87 proteincoding genes, 37 tRNA genes, and eight rRNA genes. Of these, 14 protein-coding genes and eight tRNAs contained at least one intron. The genes were classified   Fig. 1 Chloroplast genome maps for Ferula L. Genes on the inside of the circle are transcribed clockwise and those on the outside are transcribed counterclockwise. The darker gray inner circle corresponds to the GC content, whereas the lighter gray indicates the AT content. Different colors represent different functional genes into the following four groups based on their functions: (1) 74 self-replication genes; (2) 45 photosynthesisrelated genes (in Rubisco, ATP synthase, Photosystem I, cytochrome b/f complex, photosystem II, and NADH dehydrogenase groups); and 13 other genes including (3) six genes with known functions (matK, cemA, accD, ccsA, infA, and clpP) and (4) seven genes with unknown functions (ycf1(2), ycf2 (2), ycf3, ycf4, and ycf15) ( Table 2). The total GC content for 12 sequenced species was 37.8-38.0% (Table 1).

Codon usage
The RSCU values of all codons are shown in Fig. 2 in the form of a heatmap; the red values indicate higher RSCU values, and the blue values indicate lower RSCU values. For Ferula species, the most commonly used transcription initiation codon was AUG, the most commonly used termination codon was UAA, and the initiation codon AUU only existed in F. olivacea. Except for the initiation codon and termination codon, the most used transcription codon was UTA, and AGC showed the lowest RSCU values; the most abundant amino acid (AA) was leucine, while cysteine was the lowest frequency AA. Except for tryptophan, all AAs had more than one synonymous codon, and three AAs (leucine, serine, and arginine) had the most (six) synonymous codons. The use of one codon, UGG, showed no bias (RSCU = 1) (Table S2).

Repeat structure analysis
Forward, palindromic, reverse, and complementary repeats were detected in 14 Ferula plastomes. Except for IR repeats, 837 repeats were identified in total; the numbers of forward repeats (398) and palindromic repeats (421) were much higher than the complement repeats (7) and reverse repeats (11). Reverse and complementary repeats were missing in four samples (F. sibirica 1, F. kelifi, F. ovina, and F. karelinii 3). F. kelifi contained the maximum number of repeats (94), whereas F. equisetacea and F. olivacea contained the least (46) (Table S3). A total of 1,061 SSRs were identified in the 14 species, six of which did not have pentanucleotides, and hexanucleotides were only found in F. olivacea. Additionally, mononucleotides were most frequent followed by dinucleotides, tetranucleotides, trinucleotides, pentanucleotides, and hexanucleotides. F. transiliensis-1 contained the highest number of SSRs (82), whereas F. oopoda contained the least (69). Poly (A/T) SSRs were typically most common, while poly (C/G) repeats were extremely rare ( Table S4).

Comparisons of border and sequence identity
Single-copy and inverted repeat borders were examined; F. kelifi and F. equisetacea harbored the longest (31,880 bp) and shortest (31,392 bp) IR regions, respectively. Among all 14 Ferula species, rps19 is embedded in the LSC/IRb junction region and only 81 bp with the IRb overlap; ycf1 spans SSC/IRa and occupies a long section in both regions; and trnH occurs in the LSC region and is only 5 bp away from IRa, except for F. sibirica 3 (11 bp). The variety of IRb/SSC is relatively high, most (or all) of which occur in the SSC region, and the overlap with the IRb region varied from -18 to 16 bp (Fig. 3).
According to the sequence identity plots, the 14 sequences were almost identical in their genetic structure and showed a very high degree of  Table S5), yielding a maximum value of 0.01019 in ycf1. The SSC area showed the maximum nucleotide diversity followed by the LSC region, and the IR regions had the lowest Pi value. Additionally, six highly divergent regions (> 0.006) were detected in the LSC region (rps16/trnQ-UUG , trnS-UGA /psbZ, psbH/petB), SSC region (ycf1/ndhF, rpl32, ycf1), and IR region (0). We calculated the Ka/Ks ratios of the 79 common protein-coding genes to reveal selection patterns among the protein-coding genes. The Ka/Ks ratios of most of the genes were less than 0.5 or could not be computed because either the Ka or Ks value was zero; three genes (ccsA, ndhC, and ycf2) had values greater than 1; and the total Ka/Ks ratio of all genes was 0.5331 (Table S6). In addition, we found several annotation errors (ndhH and ccsA) in the previously reported sequences of F. sinkiangensis (MW411057).
The ML and BI topologies were highly supported. Ten selected genera formed 10 monophyletic groups, all of which had support values of 100 or 1 in the ML and BI trees, respectively. Ferula was divided into three main lineages (A, B and C) with maximal support (PP = 1, BS ≥ 97%), and three genera (Soranthus meyeri,
Although the IR region is thought to be the most conserved region in the chloroplast genome, contraction and expansion of the IR region is common, and is the main reason for the variation in chloroplast genome size [44][45][46]. The junction of IRb/LSC located at ycf2 gene is defined as the type without any expansion or contraction [47]. In this study, we observed that 14 sequenced complete plastomes exhibited significant IR expansion (Fig. 3). All the species expanded into rps19 at the IRb/ LSC junction region, contributing to rps19 fragment in the IRa/LSC region, and they also expanded into ycf1 at the IRb/SSC junction region, leading to an overlap between the ycf1 pseudo-gene and ndhF. This was consistent with previous studies, in which the pseudogenes ycf1 and rps19 were produced by contraction and expansion of the IR region in angiosperms [48][49][50].
RSCU value is the ratio of specific codon usage frequency to desired frequency, which can eradicate the influence of amino acid composition on codon usage and promotes the detection of synonymous codons [51,52]. Generally, the content of A/T was higher than that of G/C in plastomes codons and A/T is preferred in the third codon position [53], the bias also showed in the Ferula plastomes (Fig. 2). Leucine was encoded by 6 codons, the order of codon preference was UTA > CUT > UTG > CUA > CUC > CUG, which following previous studies [54,55]. The analysis of RSCU can provide a basis for studying the specific mechanism of synonymous codon bias preference in different species, which plays a crucial role in molecular biology basis research [56,57].
As a primary source of molecular markers, SSRs have been widely used in Ferula genetic diversity studies because of their high polymorphism rate and abundant variation at the species level [58,59]. In our study, we identified 837 repeats (Table S3) and 1,061 SSRs (Table  S4) in the 14 Ferula samples. In which, the single nucleotide and dinucleotide repeats were common, which is consistent with the results of previous studies [55,60]. In general, during the evolutionary process of species, most repeated sequences in the genome are distributed in the non-coding region and retain as little genetic information as possible to improve its genetic efficiency. Therefore, repeat sequences play an important role in species evolution [61][62][63]. The repeats found in the 12 analyzed species indicate genetic variation among the Ferula species. In addition, we also observed that the poly (A/T) SSRs were typically most common, while poly (C/G) repeats were extremely rare. These results are consistent with those of a previous study and verify the hypothesis that cpSSRs generally consist of short polyadenine (polyA) or polythymine (polyT) repeats and rarely contain tandem guanine (G) or cytosine (C) repeats [64][65][66].
Divergent hotspots play a significant role in species identification and phylogenetic information. Moreover, IR regions often show lower sequence divergence than SSC and LSC regions [67], this probably due to higher mutation rates lead to rapid genome evolution compared to other regions [68]. In our study, this phenomenon was evident that the SSC area showed the maximum nucleotide diversity followed by the LSC region, and the IR regions had the lowest Pi value (Fig. 5, Table S5). And rps16/trnQ-UUG , trnS-UGA /psbZ, psbH/petB, ycf1/ndhF, rpl32, ycf1 were detected as the most divergent regions (Pi > 0.006) across all tested plastomes, suggesting that these variable loci can be used as important references and potential molecular markers for future studies on the evolution and diversity in Ferula. Generally, the Ka/Ks ratio is used to divide genes into positive selection, neutral evolution, and purification, with a limit of one [69]. Previously studies indicated that Ka/ Ks ratios mostly are lower due to synonymous nucleotide substitutions rates that occur more often compared to nonsynonymous substitutions rates [70]. The genes with the highest Ka/Ks variability can be used as candidate barcodes to diferentiate species and in the future applied to perform phylogenetic and phylogeographic analyses [71]. Our study suggests that 76 common protein-coding genes were under purifying selection, which indicates the typical evolutionary conservation of plant plastid genes [55,72,73], and three genes (ccsA, ndhC, ycf2) were under weak positive selection (Table S6), ycf2 have been proved to be pseudogenized in many studies [74] and ccsA was located in one of the most divergent regions, possibly as a discriminating DNA barcode for Ferula species.

The relationships between Soranthus, Schumannia, Talassia and Ferula
Based on the anatomical morphological characteristics of sclerosing cell layers in the mesocarp, the genera Soranthus, Schumannia and Talassia have been proposed to be located under the genus Ferula [25,75], all of which are recognized in the Flora of China [76]. It is easily distinguished Ferula from Soranthus and Schumannia by gross morphology and inflorescence structure, combined with the presence of luteolin 7-glycosides in the leaves, that seems reasonable to combine the two genera into Soranthus [77]. Also, Talassia tends to be incorporated into Ferula because insignificant morphological differences, although a large extent similarity between T. transiliensis and F. conocaula in the spectrum of leaf flavonoids [77]. Through a comparative study of plant external morphology, fruit anatomy, and pollen morphology, Qin and Shen [27] suggest that Talassia should be an independent genus and agreed to combine the other two monotypic genera. However, the above four genera have been suggested to merge into one genus according to the presence or absence of coumarins [78]. Recentlly, the molecular phylogeny of Ferula constructed Kurzyna-Młynik et al. [7] and Panahi et al. [16,29] based on nrDNA ITS and cpDNA sequences (the rps16 intron, the rpoC1 intron and the rpoB-trnC) indicated that Soranthus, Schumannia, and Talassia were embedded in Ferula with low support values. In our study, 15 sequences (including S. meyeri, S. karelinii and T. transiliensis) covered all of the branches except the subgenera Ferula (including section Ferula and section Stenocarpa) according to the latest Ferula phylogenetic tree [16]. Our results show that all those three species representing the genera Soranthus, Schumannia and Talassia were embedded in Ferula based on phylogenetic trees with high bootstrap values (Fig. 6). The species S. meyeri and S. karelinii were clustered into section Soranthus (PP = 1, BS = 100%), and T. transiliensis and F. renardii clustered into section Glaucoselinum (PP = 1, BS = 100%), which was coincident with Panahi [16] while with higher support values. Therefore, we support the standpoint of sinking Soranthus Ledeb., Schumannia Kuntz., Talassia Korovin into synonymy of Ferula L.

Plastomes might provide new insight on phylogenetic relationships in Ferula
As one of a complex taxonomic genus within Apiaceae, the system of Ferula is paid attention at the morphological and molecular levels [7,[9][10][11][13][14][15][16]. All those efforts on taxonomic systems have contributed greatly to understanding of the genus Ferula. Kurzyna-Młynik et al. [7] published the first molecular phylogeny for Ferula to solve the relationship among Dorema, Ferula and Leutea, in which nrDNA ITS sequences were used to construct a phylogenetic tree revising Dorema and Leutea to Ferula and transferring Ferula to Scandiceae from Peucedaneae. Later, nrDNA ITS sequences and three fragments of cpDNA (the rps16 intron, the rpoC1 intron and the rpoB-trnC) were used to explore the relationship among the three genera, and it was found that Dorema was incorporated into Ferula and Leutea independently [16,29]. Although these results provide an important foundation for the identification and classification of Ferula species, all previous studies have been based on relatively short sequences with low support values owing to the relatively limited number of nuclear/chloroplast genes. In addition, nrDNA and plastid DNA are highly incongruent, and intense reticulate evolution in Ferula means that proposing an unambiguous hierarchical classification system is almost impossible [16]. Furthermore, many species of Ferula have not been specifically addressed, and many only broadly grouped into branches.
Notably, studies based on plastomes can provide new insights into the phylogenetic relationships between species. For example, Clerodendranthus spicatus is closely related to two Lamiacea species, Tectona grandis L.f. and Glechoma longituba (Nakai) Kuprian. [79]; Juglandaceae is monophyletic, and Carya cathayensis Sarg. is a sister to C. kweichowensis Kuang & A.M.Lu and C. illinoinensis (Wangenh.) K.Koch [66]; and Fagus longipetiolata Seemen and F. engleriana Seemen ex Diels form a close relationship [41]. Here, we performed phylogenetic analyses for Ferula and other genera of Apiaceae using complete plastomes, and we recognized Ferula as a monophyletic group with the integration of Soranthus, Schumannia, Talassia (PP = 1, BS = 100%). Within Ferula, we recovered three main lineages in agreement with Panahi et al. [16], who proposed a new classification based on morphological characteristics and sequence data (nrDNA ITS sequences and three cpDNA fragments). This classification divides Ferula into four subgenera and 10 sections. In addition, Caucalis, Daucus, Cuminum, and Anthriscus were all typical of Scandiceae and formed a monophyletic system with Ferula. This provides strong evidence and support for the transfer of Ferula from the Peucedaneae to the Scandiceae [7]. However, we also observed some differences. When added into Panahi et al. 's phylogenetic tree, F. sinkiangensis was clustered into the Scorodosma branch with the sister species F. kelifi. Based on our results, F. sinkiangensis is separated from F. kelifi, being clustered with F. litwinowiana in the Merwia branch. Further research is needed to confirm this phenomenon. Overall, our work demonstrates that plastome studies can provide highly useful information for future phylogenetic, taxonomic, and evolutionary studies of Ferula.

Conclusion
We obtained 14 complete cp genome sequences from 12 Ferula species (including Soranthus, Schumannia, and Talassia) and compared them based on genome structure, gene content, and gene sequences. Some hotspots in the LSC and SSC regions were identified, which may provide useful markers for phylogenetic analysis. Notably, the Gene ccsA can be used as a DNA barcode for Ferula species. Our phylogenetic analysis showed a tight connection between Soranthus Ledb., Schumannia Kuntz., Talassia Korov., and Ferula L., indicating that treatment as separate genera is unreasonable. Instead, their phylogenetic relationship, which is now well resolved, strongly supports that they can be considered synonymous with Ferula. This new genomic information not only contributes to the better development and utilization of Ferula but also provides a basis for further understanding the evolutionary, genetic, and phylogenetic relationships of this important genera.

Plant materials and DNA extraction
Fourteen samples were collected from the field and herbaria (Table S1). Of these, five specimens were taken from the specimen museum of the Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences (XJBI), one was obtained from the Komarov Botanical Institute of RAS (LE), five specimens were taken from the National Herbarium of Uzbekistan (TASH), and three were collected from the field in Tajikistan. Leaf samples were dried in silica gel and stored at -20 °C for DNA extraction. DNA extraction was performed using a plant genome extraction kit (DP320) from Tiangen Biochemical Technology (Beijing) according to the manufacturer's instructions.

DNA sequencing and genome assembly and annotation
The extracted DNA was sent to a sequencing company for automatic sequencing using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England BIolabs) [80]. DNA extracts were quantified and sheared into approximately 500 base pair (bp) fragments for library construction using standard protocols (NEBNext Ultra IITMDNA Library Prep Kit for Illumina). Pairedend sequencing from both ends of 150 bp fragments was performed on the Illumina HiSeq X Ten platform at the Molecular Biology Experiment Center, Germplasm Bank of Wild Species in Southwest China, to generate no less than 2 GB data for each individual.

Codons, repeat sequences, and simple sequences repeat analysis
The protein-coding genes were extracted for codon analysis. The final dataset included 86 protein-coding genes from each species. Codon usage and relative synonymous codon usage (RSCU) values were calculated using JSHY-Cloud (http:// cloud. genep ioneer. com: 9929). A heatmap of all the RSCU values of the 14 plastomes was produced using ClustVis [87]. Using the parameters of a Hamming distance of 3, a minimum repeat size of 30 bp, and a maximum repeat size of 5,000 bp, REPuter was used to identify the size and location of four types of repeat sequences (i.e., forward, palindromic, reverse, and complement) [88]. Simple sequence repeats (SSRs) were detected using the online MISA software (http:// pgrc. ipkga tersl eben. de/ misa/ misa. html) with minimum repeat number settings of 10, 5, 4, 3, 3, and 3 for mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides, respectively.

Genome comparison with other Ferula species and selective pressure analysis
Sequence divergence among the 14 chloroplast (cp) genomes was compared using Mafft (version 7.0) [89], IRscope (https:// irsco pe. shiny apps. io/ irapp/) and Mauve [90]. DnaSP [91] was used to calculate nucleotide divergence values using the sliding window method, with a window length of 800 bp and a step size of 200 bp. Selective pressures were analyzed for 79 common proteincoding genes among 15 Ferula species (including one published plastome). The ratio of nonsynonymous to synonymous nucleotide substitution rates (Ka/Ks) was calculated using DnaSP.

Phylogenetic analysis
We used 25 complete plastome sequences to infer the phylogenetic relationships of Ferula. After comparison with Mafft, Trimal [92], and Phylosuite [93] were used to trim areas with poor quality. The phylogenetic tree was then constructed using RaxML-HPC v.8 [94] and the maximum likelihood method with 1,000 replicates and the GTRGAMMA model. After screening for the best model using jModelTest2 [95], MrBayes 3.2.7a [96] was used to construct a Bayes tree, and the selected models for the complete plastome sequences in BI analyses were TPM1uf + I + G, and iTOL [97] and FigTree 1.4.2 [98] were used to construct the phylogenetic tree.